
The dataset selected for the condensed network proof of concept was HLA-A*02:02 peptide binding data (a total of 1447 peptides of which 649 were binders). A simple hierarchical clustering algorithm found 1158 clusters of more than 65\% identical peptides. 

For each of the five holdouts with independent evaluation data, a set of 1200 neural networks was generated as described. The effect of the various parameters and data schemes was evaluated for each network. Figure \ref{fig:neurons_effect} shows the effect of the number of hidden neurons on the evaluation performance for the five holdouts. The networks trained with 1 hidden neuron generally perform lower than those using a higher amount. Using 3 hidden neurons also slightly improves the performance compared to 2 hidden neurons, although this tendency was only observed for 3 of the 5 holdouts. Overall, selecting the evaluation set has a significant influence on performance as indicated by the difference between holdouts 1--5. 

\begin{figure}[!tpb]
\centerline{\includegraphics[trim=0cm 0cm 0cm 0cm, clip=true, width=1.00\linewidth]{Graphics/neurons_evalcor.pdf}}
\caption{PCC between prediction and evaluation data of 1200 networks for each of the five holdouts using different number of hidden neurons.}
\label{fig:neurons_effect}
\end{figure}

Figure \ref{fig:encoding_evalcor} shows the effect of input encoding scheme on the performance. For 4 of the holdouts the sparse encoding yielded the highest PCC between predictions and evaluation data. The effect of encoding scheme is significantly higher than the effect of the number of hidden neurons on the evaluation performance. 

\begin{figure}[!tpb]
\centerline{\includegraphics[trim=0cm 0cm 0cm 0cm, clip=true, width=1.00\linewidth]{Graphics/encoding_evalcor.pdf}}
\caption{Pearson correlation coefficient between prediction and evaluation data of 1200 networks for each of the five holdouts using BLOSUM and sparse input encoding.}\label{fig:encoding_evalcor}
\end{figure}

PCC of evaluation set predictions were plotted against PCC of test set predictions for all neural networks to evaluate if good networks could be selected based on test set performance. The result (fig. \ref{fig:test_eval_cor}) shows almost no correlation between the test- and evaluation performance of the trained networks. Furthermore, the analysis was done on all 12 subsets for every holdout to see the effect of dividing up training and test data in several cycles (shown in different colours). The latter analysis shows significant bias of test set PCC in the random subsets, visible in holdout 4 and 5. However no bias is observed in the final evaluation PCC on the neural networks trained with these subset. 

\begin{figure}[!tpb]
\centerline{\includegraphics[trim=0.3cm 0cm 0.3cm 0cm, clip=true, width=0.90\linewidth]{Graphics/test_eval_cor.pdf}}
\caption{PCC of the evaluation set plotted against PCC of the test set for each of the 5 holdouts. The different colours represent the 12 different subsets.}\label{fig:test_eval_cor}
\end{figure}

The performance of the ensemble was found to be high relative to the average performance of the individual networks (fig. \ref{fig:correlation_density}, green dots). The EasyPred neural network prediction tool was utilized to compare the performance of the ensemble. Using 4-fold cross validation on combined training and test set for each holdout yielded an average distance in PCC of 0.0069. 

\begin{figure}[!tpb]
\centerline{\includegraphics[trim=0cm 0cm 0cm 1cm, clip=true, width=1.00\linewidth]{Graphics/correlation_density.pdf}}
\caption{Density plots of the PCC scores for the 1200 networks in each holdout. The coloured dots represent the PCC scores for the ensemble, condensed networks and EasyPred networks.}
\label{fig:correlation_density}
\end{figure}

